2. Feature Engineering

Python

Machine Learning

Data Visualization

Published

September 21, 2025

Brief Look at the Dataset

First we need to load our dataframe from the csv file we created in part 1. Then, lets take a look at all the columns in the dataset.

import pandas as pd
import ast

df = pd.read_csv('data/first_gen_pokemon_cards.csv')

columns_to_parse = ['weaknesses', 'resistances', 'subtypes', 'types', 'abilities', 'attacks', 'nationalPokedexNumbers', 'evolvesTo', 'rules']
for col in columns_to_parse:
    if col in df.columns:
        df[col] = df[col].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) and x != 'nan' and pd.notna(x) else ([] if col != 'nationalPokedexNumbers' else None))

print(df.columns)

Index(['id', 'name', 'supertype', 'subtypes', 'level', 'hp', 'types',
       'evolvesFrom', 'abilities', 'attacks', 'weaknesses', 'retreatCost',
       'convertedRetreatCost', 'number', 'artist', 'rarity', 'flavorText',
       'nationalPokedexNumbers', 'legalities', 'images', 'evolvesTo',
       'resistances', 'rules', 'regulationMark', 'ancientTrait'],
      dtype='object')

I took these columns and created a simple data dictionary for reference:

Column Name	Data Type	Description	Allowed Values	Examples	Missing Values
id	String	Unique identifier for each card	Alphanumeric strings	“xy7-54”, “sm3-22”	No
name	String	Name of the Pokemon card	Alphanumeric strings	“Pikachu”, “Charizard”	No
supertype	String	Broad category of the card	“Pokémon”	“Pokémon”	No
subtype	String	More specific category within the supertype	Array of strings	[“Basic”, “Stage 1”, “Stage 2”, “EX”, “Team Plasma”…]	No
level	String	Level of the Pokémon (if applicable)	Alphanumeric strings or X	“12”, “45”, “X”	Yes
hp	Integer	Hit points of the Pokémon	Positive integers	60, 120, 200	No
types	Array of strings	Types of the Pokémon	[“Fire”, “Water”, “Grass”, “Electric”, “Psychic”, “Fighting”, “Darkness”, “Metal”, “Fairy”, “Dragon”, “Colorless”]	[“Fire”], [“Water”, “Flying”]	No
evolvesFrom	String	Name of the Pokémon this card evolves from (if applicable)	Alphanumeric strings	“Pikachu”, “Charmander”	Yes
abilities	Array of objects	Special abilities of the Pokémon	Objects with name, text, and type fields	[{name: “Static”, text: “May paralyze opponent’s Pokémon”, type: “Poké-Body”}]	Yes
attacks	Array of objects	Attacks that the Pokémon can perform	Objects with name, cost, convertedEnergyCost, damage, and text fields	[{name: “Thunder Shock”, cost: [“Electric”, “Colorless”], convertedEnergyCost: 2, damage: “30”, text: “May paralyze opponent’s Pokémon”}]	Yes
weaknesses	Array of objects	Weaknesses of the Pokémon	Objects with type and value fields	[{type: “Fighting”, value: “×2”}]	Yes
retreatCost	Array of strings	Energy types required to retreat the Pokémon	[“Colorless”]	[“Colorless”, “Colorless”]	Yes
convertedRetreatCost	Integer	Total number of energy required to retreat the Pokémon	Non-negative integers	1, 2, 3	Yes
number	String	Card number within its set	Alphanumeric strings	“54”, “22”	No
artist	String	Name of the card’s illustrator	Alphanumeric strings	“Mitsuhiro Arita”, “5ban Graphics”	Yes
rarity	String	Rarity level of the card	“Common”, “Uncommon”, “Rare”, “Holo Rare”, “Ultra Rare”, “Secret Rare”, etc.	“Common”, “Holo Rare”	Yes
flavorText	String	Flavor text providing background or lore about the Pokémon	Alphanumeric strings	“When several of these Pokémon gather, their electricity could build and cause lightning storms.”	Yes
nationalPokedexNumbers	Array of integers	National Pokédex numbers associated with the Pokémon	Positive integers	[25], [6]	No
legalities	Object	Legality of the card in various formats	Fields for “expanded”, “standard”, “unlimited” with values “Legal” or “Not Legal”	{expanded: “Legal”, standard: “Not Legal”, unlimited: “Legal”}	No
images	Object	URLs for the card’s images	Fields for “small” and “large” with URL strings	{small: “http://…”, large: “http://…”}	No
evolvesTo	Array of strings	Names of Pokémon this card can evolve into (if applicable)	Alphanumeric strings	[“Raichu”, “Pikachu Libre”]	Yes
resistances	Array of objects	Resistances of the Pokémon	Objects with type and value fields	[{type: “Metal”, value: “-20”}]	Yes
rules	Array of strings	Special rules that apply to the card	Alphanumeric strings	[“If this Pokémon is your Active Pokémon, your opponent’s attacks do 20 less damage (before applying Weakness and Resistance).”]	Yes
regulationMark	String	Regulation mark for tournament legality	Single uppercase letters	“D”, “E”	Yes
ancientTrait	Object	Ancient Trait of the Pokémon (if applicable)	Object with name and text fields	{name: “Delta Evolution”, text: “This Pokémon can evolve from any type of basic Pokémon.”}	Yes

We can see that there are quite a few features that are not necessary; the obvious ones are id and imagessince these features are unique identifiers and urls. We can drop these columns from the dataframe. Now we can focus on the features that would help a model learn the game mechanics that determines the hit points of a pokemon card. Given that this is our goal, we can also drop legalities and regulationMark columns since these columns pertain to the actual card game rules and not the pokemon card itself. Finally, we can also drop the supertype column since all of the cards in our dataset are of the same supertype Pokémon.

The other features still have some columns that I believe are not useful for predicting the hit points of a pokemon card but it is hard to tell without running through some analysis.

Feature Engineering

I look all the columns in the dataset and decided on the following feature engineering steps:

Column Name	Feature Engineering Steps
`id`	We will drop this column since it is a unique identifier and does not provide any useful information for predicting hit points.
`images`	We will drop this column since it contains URLs to images and does not provide any useful information for predicting hit points.
`legalities`	We will drop this column since it pertains to the card game rules and not the pokemon card itself.
`regulationMark`	We will drop this column since it pertains to the card game rules and not the pokemon card itself.
`supertype`	We will drop this column since all of the cards in our dataset are of the same supertype `Pokémon`.
`hp`	This is our target variable that we are trying to predict. We don’t need to do any feature engineering on this column.
`level`	Most of the values in this column are missing, but we can fill in the missing values with the median level of the pokemons and create a new feature indicating whether the level was missing or not.
`nationalPokedexNumbers`	We will convert this to a numerical value by taking the first number in the array.
`convertedRetreatCost`	This is already a numerical value and can be used as is. We will fill in any missing values with 0.
`rarity`	We can use one-hot encoding to convert this categorical feature into multiple binary features.
`evolvesFrom`	We can create a new binary feature indicating whether the pokemon evolves from another pokemon or not. 0 for no and 1 for yes.
`evolvesTo`	This can be the same as `evolvesFrom`, we can create a new binary feature indicating whether the pokemon evolves to another pokemon or not. 0 for no and 1 for yes.
`subtypes`	We can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
`types`	Similar to `subtypes`, we can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features
`weaknesses`	We can extract three features from this column: - `weakness_types`: We can extract the types from the weaknesses and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features. - `total_weakness_multiplier`: We can extract the multiplier values from the weaknesses and sum them up to create a new numerical feature (e.g. “×2” -> 2). - `total_weakness_modifier`: We can extract the modifier values from the weaknesses (e.g. “+20” -> 20) and multiply them together to create a new numerical feature.
`resistances`	Similar to `weaknesses`, we can extract three features from this column: - `resistance_types`: We can extract the types from the resistances and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features. - `total_resistance_modifier`: We can extract the modifier values from the resistances (e.g. “-20” -> -20) and sum them up to create a new numerical feature. - `total_resistance_multiplier`: We can extract the multiplier values from the resistances and multiply them together to create a new numerical feature.
`retreatCost`	Since this column is the same as `convertedRetreatCost`, we can drop this column.
`name`	This feature has a very high cardinality. Originally my idea was to count the number of times each name appears in the dataset and use that as a feature. However, we can already count this using the `nationalPokedexNumbers` feature since each pokemon name corresponds to a unique pokedex number. Therefore, we can drop this feature.
`artist`	Similar to `name`, we can count the number of times each artist appears in the dataset and use that as a feature.
`abilities`	I will split this into three features: - `ability_count`: The number of abilities the pokemon has. - `ability_text`: The combined text of all abilities. - `has_pokemon_power`: A binary feature indicating whether the pokemon has a Poké-Body or Poké-Power ability.
`attacks`	Similar to `abilities`, I will split this into three features: - `attack_count`: The number of attacks the pokemon has. - `attack_text`: The combined text of all attacks.
`rules`	I will create a binary feature indicating whether the pokemon has any special rules or not.
`ancientTrait`	I will create a binary feature indicating whether the pokemon has an ancient trait or not.
`flavorText`	I believe the flavor text does not provide any information that could help us predict the HP of a pokemon card but lets use TfidfVectorizer and run analysis on it to see.

Let also take a brief look at the data in our dataframe before we proceed with the feature engineering steps.

print(df.shape)
df.head(3)

(4470, 25)

	id	name	supertype	subtypes	level	hp	types	evolvesFrom	abilities	attacks	...	rarity	flavorText	nationalPokedexNumbers	legalities	images	evolvesTo	resistances	rules	regulationMark	ancientTrait
0	base1-1	Alakazam	Pokémon	[Stage 2]	42	80	[Psychic]	Kadabra	[{'name': 'Damage Swap', 'text': 'As often as ...	[{'name': 'Confuse Ray', 'cost': ['Psychic', '...	...	Rare Holo	Its brain can outperform a supercomputer. Its ...	[65]	{'unlimited': 'Legal'}	{'small': 'https://images.pokemontcg.io/base1/...	[]	[]	[]	NaN	NaN
1	base1-2	Blastoise	Pokémon	[Stage 2]	52	100	[Water]	Wartortle	[{'name': 'Rain Dance', 'text': 'As often as y...	[{'name': 'Hydro Pump', 'cost': ['Water', 'Wat...	...	Rare Holo	A brutal Pokémon with pressurized water jets o...	[9]	{'unlimited': 'Legal'}	{'small': 'https://images.pokemontcg.io/base1/...	[]	[]	[]	NaN	NaN
2	base1-3	Chansey	Pokémon	[Basic]	55	120	[Colorless]	NaN	[]	[{'name': 'Scrunch', 'cost': ['Colorless', 'Co...	...	Rare Holo	A rare and elusive Pokémon that is said to bri...	[113]	{'unlimited': 'Legal'}	{'small': 'https://images.pokemontcg.io/base1/...	[Blissey]	[{'type': 'Psychic', 'value': '-30'}]	[]	NaN	NaN

3 rows × 25 columns

Cleaning the Data

In this section we will focus on dropping columns and extracting features from our initial list of features. We will then transform and scale them in the next section. Lets first drop these columns that we decided aren’t useful from the dataframe:

id
images
legalities
regulationMark
supertype
retreatCost
name

df.drop(columns=['id', 'images', 'legalities', 'regulationMark', 'supertype', 'retreatCost', 'name'], inplace=True)

Direct Numerical Features

We can start with the columns that are already numerical values. These columns are:

level: I am replacing X found in levels with 100 which is the highest level you can train a pokemon to in a game. I will be filling in missing data later.

df['level'] = df['level'].apply(
  lambda x: int(x.replace('X', '100')) if isinstance(x, str) and x != 'nan' and pd.notna(x) else None
)

level_was_missing: A binary feature indicating whether the level was missing or not.

df['level_was_missing'] = df['level'].isnull().astype(int)

nationalPokedexNumbers: We will convert this to a numerical value by taking the first number in the array.

df['primary_pokedex_number'] = df['nationalPokedexNumbers'].apply(
    lambda x: x[0] if isinstance(x, list) and len(x) > 0 else None
)

pokemon_count: Counts how many Pokemon are in the nationalPokedexNumbers array.

df['pokemon_count'] = df['nationalPokedexNumbers'].apply(
    lambda x: len(x) if isinstance(x, list) else 0
)

convertedRetreatCost: This is already a numerical value and can be used as is. We just need to fill in any missing values with 0.

df['convertedRetreatCost'] = df['convertedRetreatCost'].fillna(0)
df['convertedRetreatCost'] = df['convertedRetreatCost'].replace('.', 0).astype(int)

number: We will convert this to a numerical value by taking the subset number before or after any non-numeric characters. For example, “54a” would be converted to 54.

import re

df['number'] = df['number'].apply(
    lambda x: int(re.search(r'\d+', str(x)).group()) if pd.notna(x) and re.search(r'\d+', str(x)) else None
)

Simple Categorical Features

Next we can look at the simple categorical features that have a limited number of unique values:

rarity: We can use one-hot encoding to convert this categorical feature into multiple binary features.

from sklearn.preprocessing import OneHotEncoder

hot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
df['rarity'] = df['rarity'].fillna('Unknown')
rarity_encoded = hot_encoder.fit_transform(df[['rarity']])
rarity_encoded_df = pd.DataFrame(rarity_encoded, columns=hot_encoder.get_feature_names_out(['rarity']))
df = pd.concat([df, rarity_encoded_df], axis=1)
df.drop(columns=['rarity'], inplace=True)

evolvesFrom: We can create a new binary feature indicating whether the pokemon evolves from another pokemon or not. 0 for no and 1 for yes.

df['evolvesFrom'] = df['evolvesFrom'].notnull().astype(int)

evolvesTo: This can be the same as evolvesFrom, we can create a new binary feature indicating whether the pokemon evolves to another pokemon or not. 0 for no and 1 for yes.

df['evolvesTo'] = df['evolvesTo'].apply(lambda x: int(isinstance(x, list) and len(x) > 0))

List-Based Categorical Features

Next we can look at the list-based categorical features. For these features, we will need to extract the modifiers from weaknesses and resistances so we first can define a function to do that. Then we can proceed with the feature extraction.

def extract_modifiers(modifier_list):
  if not isinstance(modifier_list, list):
    return (0, 0)

  total_multiplier = 0
  total_modifier = 0

  for item in modifier_list:
    value_str = item['value'].strip()

    if '×' in value_str:
      numeric_part = value_str.replace('×', '')
      total_multiplier += int(numeric_part)
    elif '+' in value_str or '-' in value_str:
      total_modifier += int(value_str)
          
  return (total_multiplier, total_modifier)

subtypes: We can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()

subtypes_encoded = mlb.fit_transform(df['subtypes'].fillna('None').apply(lambda x: x if isinstance(x, list) else [x]))
subtypes_encoded_df = pd.DataFrame(subtypes_encoded, columns=[f'subtype_{cls}' for cls in mlb.classes_])
df = pd.concat([df, subtypes_encoded_df], axis=1)
df.drop(columns=['subtypes'], inplace=True)

types: Similar to subtypes, we can use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.

types_encoded = mlb.fit_transform(df['types'].fillna('None').apply(lambda x: x if isinstance(x, list) else [x]))
types_encoded_df = pd.DataFrame(types_encoded, columns=[f'type_{cls}' for cls in mlb.classes_])
df = pd.concat([df, types_encoded_df], axis=1)
df.drop(columns=['types'], inplace=True)

weaknesses: We can extract three features from this column:
- weakness_types: We can extract the types from the weaknesses and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
- total_weakness_multiplier: We can extract the multiplier values from the weaknesses and sum them up to create a new numerical feature (e.g. “×2” -> 2).
- total_weakness_modifier: We can extract the modifier values from the weaknesses (e.g. “+20” -> 20) and multiply them together to create a new numerical feature.

mlb_weakness = MultiLabelBinarizer()

weakness_encoded = mlb_weakness.fit_transform(
    df['weaknesses'].apply(
        lambda x: [w['type'] for w in x] if isinstance(x, list) else []
    )
)
weakness_encoded_df = pd.DataFrame(weakness_encoded, columns=[f'weakness_{cls}' for cls in mlb_weakness.classes_])
df = pd.concat([df, weakness_encoded_df], axis=1)

total_weakness_values = df['weaknesses'].apply(extract_modifiers)
df[['total_weakness_multiplier', 'total_weakness_modifier']] = pd.DataFrame(
  total_weakness_values.tolist(), 
  index=df.index
)
df.drop(columns=['weaknesses'], inplace=True)

resistances: Similar to weaknesses, we can extract three features from this column:
- resistance_types: We can extract the types from the resistances and use MultiLabelBinarizer to convert this list-based categorical feature into multiple binary features.
- total_resistance_modifier: We can extract the modifier values from the resistances (e.g. “-20” -> -20) and sum them up to create a new numerical feature.
- total_resistance_multiplier: We can extract the multiplier values from the resistances and multiply them together to create a new numerical feature.

mlb_resistance = MultiLabelBinarizer()

resistance_encoded = mlb_resistance.fit_transform(
    df['resistances'].apply(
        lambda x: [r['type'] for r in x] if isinstance(x, list) else []
    )
)
resistance_encoded_df = pd.DataFrame(resistance_encoded, columns=[f'resistance_{cls}' for cls in mlb_resistance.classes_])
df = pd.concat([df, resistance_encoded_df], axis=1)

total_resistance_values = df['resistances'].apply(extract_modifiers)
df[['total_resistance_multiplier', 'total_resistance_modifier']] = pd.DataFrame(
    total_resistance_values.tolist(),
    index=df.index
)
df.drop(columns=['resistances'], inplace=True)

High-Cardinality Categorical Features

Lets take a look at the categorical features that have a high number of unique values:

pokedex_frequency: We can count the number of times each pokedex number appears in the dataset and use that as a feature.

# Convert lists to tuples (hashable) for frequency counting
df['pokedex_frequency'] = df['nationalPokedexNumbers'].apply(
    lambda x: tuple(x) if isinstance(x, list) else None
).map(
    df['nationalPokedexNumbers'].apply(
        lambda x: tuple(x) if isinstance(x, list) else None
    ).value_counts()
)
df.drop(columns=['nationalPokedexNumbers'], inplace=True)

artist: We can count the number of times each artist appears in the dataset and use that as a feature.

df['artist_frequency'] = df['artist'].map(df['artist'].value_counts())
df.drop(columns=['artist'], inplace=True)

Complex JSON/Text Features

Finally, we have the more complex features that are in JSON format or text:

abilities: I will split this into three features:
- ability_count: The number of abilities the pokemon has.
- ability_text: The combined text of all abilities.
- has_pokemon_power: A binary feature indicating whether the pokemon has a Poké-Body or Poké-Power ability.

df['ability_count'] = df['abilities'].apply(lambda x: len(x) if isinstance(x, list) else 0)
df['ability_text'] = df['abilities'].apply(lambda x: ' '.join([ability['text'] for ability in x]) if isinstance(x, list) else '')
df['has_pokemon_power'] = df['abilities'].apply(lambda x: int(any(ability['name'] in ['Poké-Body', 'Poké-Power'] for ability in x)) if isinstance(x, list) else 0)
df.drop(columns=['abilities'], inplace=True)

attacks: Similar to abilities, I will split this into three features:
- attack_count: The number of attacks the pokemon has.
- max_damage: The explicit maximum damage value among all attacks. Since some damage values may contain non-numeric characters (e.g., “30+”, “50x”), we will extract the numeric part and convert it to an integer. If no numeric value is present, we will search for a number in the attack text to use. In the future we can also consider more complex parsing methods to better estimate the maximum damage.
- attack_cost: The total converted energy cost of all attacks.

df['attack_count'] = df['attacks'].apply(lambda x: len(x) if isinstance(x, list) else 0)

def extract_max_damage(attacks):
    if not isinstance(attacks, list) or len(attacks) == 0:
        return 0
    
    damages = []
    
    for attack in attacks:
        if isinstance(attack.get('damage'), str):
            damage_str = attack['damage'].replace('+', '').replace('-', '').replace('×', '').strip()
            if damage_str.isdigit():
                damages.append(int(damage_str))
                continue
            
        if isinstance(attack.get('text'), str):
            numbers = re.findall(r'\b(\d+)\b', attack['text'])
            if numbers:
                damages.append(max(int(num) for num in numbers))
    
    return max(damages, default=0)

df['max_damage'] = df['attacks'].apply(extract_max_damage)

df['attack_cost'] = df['attacks'].apply(lambda x: sum([len(attack['cost']) for attack in x]) if isinstance(x, list) else 0)
df.drop(columns=['attacks'], inplace=True)

rules: This can be converted into a binary feature indicating whether the pokemon has any special rules or not.

df['has_rules'] = df['rules'].apply(lambda x: int(isinstance(x, list) and len(x) > 0))
df.drop(columns=['rules'], inplace=True)

ancientTrait: This can also be converted into a binary feature indicating whether the pokemon has an ancient trait or not.

df["has_ancient_trait"] = df['ancientTrait'].apply(lambda x: int(isinstance(x, dict)))
df.drop(columns=['ancientTrait'], inplace=True)

What Our New Dataset Looks Like

After performing all the feature engineering steps, lets take a look at the first few rows of our new dataframe to see what it looks like now.

print(df.shape)
df.head(3)

(4470, 111)

	level	hp	evolvesFrom	convertedRetreatCost	number	flavorText	evolvesTo	primary_pokedex_number	pokemon_count	...	pokedex_frequency	artist_frequency	ability_count	ability_text	attack_count	max_damage	attack_cost
0	42.0	80	1	3	1	Its brain can outperform a supercomputer. Its ...	0	65	1	...	29	471.0	1	As often as you like during your turn (before ...	1	30	3
1	52.0	100	1	3	2	A brutal Pokémon with pressurized water jets o...	0	9	1	...	41	471.0	1	As often as you like during your turn (before ...	1	40	3
2	55.0	120	0	1	3	A rare and elusive Pokémon that is said to bri...	1	113	1	...	25	471.0	0		2	80	6

3 rows × 111 columns

We can see that we have successfully transformed our original dataframe into a more structured format that is suitable for machine learning models. From 25 original columns, we now have 111 features that capture various aspects of the pokemon cards. We save this new dataframe to a csv file for future use.

df.to_csv('data/processed_pokemon_cards.csv', index=False)